Skip to content

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Oct 15, 2025

Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

Rationale for this change

While testing the arrow 57 upgrade in DataFusion I found a few things that need to be fixed
in parquet-rs.

One was that the method ArrowWriter::into_serialized_writer was deprecated, (which I know I suggested in #8389 🤦 ). However, when testing it turns out that the constructor of SerializedFileWriter does a lot of work (like creating the parquet schema from the arrow schema and messing with metadata)

let mut props = options.properties;
let schema = if let Some(parquet_schema) = options.schema_descr {
parquet_schema.clone()
} else {
let mut converter = ArrowSchemaConverter::new().with_coerce_types(props.coerce_types());
if let Some(schema_root) = &options.schema_root {
converter = converter.schema_root(schema_root);
}
converter.convert(&arrow_schema)?
};
if !options.skip_arrow_metadata {
// add serialized arrow schema
add_encoded_arrow_schema_to_metadata(&arrow_schema, &mut props);
}
let max_row_group_size = props.max_row_group_size();
let props_ptr = Arc::new(props);
let file_writer =
SerializedFileWriter::new(writer, schema.root_schema_ptr(), Arc::clone(&props_ptr))?;
let row_group_writer_factory =
ArrowRowGroupWriterFactory::new(&file_writer, arrow_schema.clone());
Ok(Self {
writer: file_writer,
in_progress: None,
arrow_schema,
row_group_writer_factory,
max_row_group_size,
})

Creating a RowGroupWriterFactory directly would involve a bunch of code duplication

What changes are included in this PR?

So let's not deprecate this method for now and instead add some additional docs to guide people to the right lace

Are these changes tested?

I tested manually upstream

Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 15, 2025
/// Flushes any outstanding data before returning.
///
/// This can be useful to provide more control over how files are written, for example
/// to write columns in parallel. See the example on [`ArrowColumnWriter`].
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to drop a few breadcrumbs to find @adamreeve 's new example in #8582

@alamb alamb marked this pull request as ready for review October 15, 2025 13:01
@mbrobbel mbrobbel added this to the 57.0.0 milestone Oct 15, 2025
///
/// You can create this structure via an [`ArrowRowGroupWriterFactory`]
///
/// See the example on [`ArrowColumnWriter`] for how to encode columns in parallel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrowRowGroupWriter is actually not public and not used in the ArrowColumnWriter example, so this last sentence might be a little confusing.

(ArrowRowGroupWriterFactory is internally used to create ArrowRowGroupWriters, but publicly it only exposes the ability to create a Vec<ArrowColumnWriter>. If I was starting this again from scratch I might have named things a little differently...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks -- I tried to improve the comments in 619210d

@alamb
Copy link
Contributor Author

alamb commented Oct 16, 2025

Thank you @adamreeve and @mbrobbel

@alamb alamb merged commit 5a384f4 into apache:main Oct 16, 2025
16 checks passed
@alamb alamb deleted the alamb/into_serialized_writer branch October 16, 2025 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants